Distinct Values Estimators for Power Law Distributions
نویسندگان
چکیده
The number of distinct values in a relation is an important statistic for database query optimization. As databases have grown in size, scalability of distinct values estimators has become extremely important, since a näıve linear scan through the data is no longer feasible. An approach that scales very well involves taking a sample of the data, and performing the estimate on the sample. Unfortunately, it has been shown that obtaining estimators with guaranteed small error bounds requires an extremely large sample size in the worst case. On the other hand, it is typically the case that the data is not worst-case, but follows some form of a Power Law or Zipfian distribution. We exploit data distribution assumptions to devise distinct-values estimators with analytic error guarantees for Zipfian distributions. Our estimators are the first to have the required number of samples depend only on the number of distinct values present, D, and not the database size, n. This allows the estimators to scale well with the size of the database, particularly if the growth is due to multiple copies of the data. In addition to theoretical analysis, we also provide experimental evidence of the effectiveness of our estimators by benchmarking their performance against previously best known heuristic and analytic estimators on both synthetic and real-world datasets.
منابع مشابه
Application of Power-Law Frequency Fractal Model for Recognition of Vertical Cu Distribution in Milloieh Porphyry Deposit, SE Iran
Identification of the vertical and horizontal distributions for elemental grades is of an important sign in different mineral exploration stages. The main aim of this work is to determine the vertical distribution directional properties of Cu values in the Milloieh Cu porphyry deposit, Kerman (SE Iran) using the power-law frequency fractal model. This work is carried out based on four mineraliz...
متن کاملTruncated Linear Minimax Estimator of a Power of the Scale Parameter in a Lower- Bounded Parameter Space
Minimax estimation problems with restricted parameter space reached increasing interest within the last two decades Some authors derived minimax and admissible estimators of bounded parameters under squared error loss and scale invariant squared error loss In some truncated estimation problems the most natural estimator to be considered is the truncated version of a classic...
متن کاملOn Bivariate Generalized Exponential-Power Series Class of Distributions
In this paper, we introduce a new class of bivariate distributions by compounding the bivariate generalized exponential and power-series distributions. This new class contains the bivariate generalized exponential-Poisson, bivariate generalized exponential-logarithmic, bivariate generalized exponential-binomial and bivariate generalized exponential-negative binomial distributions as specia...
متن کاملGeneralizing Benford's Law Using Power Laws: Application to Integer Sequences
Many distributions for first digits of integer sequences are not Benford. A simple method to derive parametric analytical extensions of Benford’s law for first digits of numerical data is proposed. Two generalized Benford distributions are considered, namely, the two-sided power Benford TSPB distribution, which has been introduced in Hürlimann 2003 , and the new Pareto Benford PB distribution. ...
متن کاملGeneralizing Benfords Law Using Power Laws: Application to Integer Sequences
A simple method to derive parametric analytical extensions of Benfords law for first digits of numerical data is proposed. Two generalized Benford distributions are considered, namely the two-sided power Benford (TSPB) distribution, which has been introduced in Hürlimann(2003), and the new Pareto Benford (PB) distribution. Based on the minimum chisquare estimators, the fitting capabilities of ...
متن کامل